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AN EFFICIENT SOLVER FOR SPARSE LINEAR SYSTEMS BASED ON 
RANK-STRUCTURED CHOLESKY FACTORIZATION 


JEFFREY N. CHADWICK AND DAVID S. BINDEL 

Abstract. Direct factorization methods for the solution of large, sparse linear systems that arise from PDE 
discretizations ai‘e robust, but typically show poor time and memory scalability for lai'ge systems. In this paper, we 
describe an efficient sparse, rank-structured Cholesky algorithm for solution of the positive definite linear system 
Ax = b when A comes from a discretized partial-differential equation. Our approach combines the efficient mem¬ 
ory access patterns of conventional supemodal Cholesky algorithms with the memory efficiency of rank-structured 
direct solvers. For several test problems arising from PDE discretizations, our method takes less memory than 
standard sparse Cholesky solvers and less wall-clock time than standard preconditioned iterations. 
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1. Introduction. We consider the problem of solving a sparse linear system 


Ax = b (1.1) 

in which A is symmetric and positive definite (SPD). In particular, we consider the Cholesky 
factorization 


A = LL^, (1.2) 

where L is a sparse lower triangular matrix; see naiiiiii. After factoring A, one solves CD 
by two triangular solves, at cost proportional to the number of nonzeros in L. This approach 
solves CD exactly up to roundoff effects, and modern supemodal factorization algorithms 
achieve high flop rates by organizing the factorization around dense matrix kernels. The chief 
drawback of sparse direct methods is that the factor L may generally have many more nonzero 
elements than A. These^l/ elements limit scalability of the method in both time and memory 
used, particularly for problems coming from the discretization of three-dimensional PDEs, 
where the number of nonzeros in L typically scales as where N is the dimension 

of A. 

Compared to direct factorization, iterative methods for generally cost less in mem¬ 
ory and in time per step than direct methods, but converge slowly without a good precondi¬ 
tioner. Preconditioning involves a complex balance between the progress in each step and 
the cost of setting up and applying the preconditioner. Even for a single preconditioner type, 
there are usually many parameters that are optimized on a problem-by-problem basis. For 
this reason, packages like PETS provide interfaces to allow users to quickly experiment 
with different preconditioners and parameter settings IT], while commercial hnite element 
codes often forego the potential benehts of iterative methods and simply use out-of-core di¬ 
rect solvers ll26l . 

Fast direct factorization methods and preconditioned iterative solvers each use a different 
types of structure. A key idea behind sparse direct methods is that one can use the structure 
of the graph associated with A to reason about about hll in L. This graph-theoretic approach 
underlies many modern sparse matrix algorithms, from methods of computing fill-reducing 
elimination orderings to “supemodal” factorization methods organized around dense matrix 
operations on columns with similar nonzero structure ||71. In contrast, to solve problems 
arising from elliptic PDE discretization efficiently, multi-level preconditioners exploit the 
elliptic regularity of the underlying differential equation. Building on ideas from fast direct 
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solvers for integral equations 113, recent work in “data-sparse” direct solvers uses both types 
of structure at once, computing a structured factorization that incorporates (approximate) 
low-rank blocks lfT4l 123 l28l fT3l . 

In this paper, we describe an efficient sparse, rank-structured Cholesky algorithm for the 
solution of when A comes from discretization of a PDE. Our method combines the 
efficient memory access patterns of conventional supernodal Cholesky algorithms with rank- 
structured direct solvers. Unlike prior solvers, our method works as a “black box” solver, 
and does not require information about an underlying PDE mesh. Eor several test problems 
arising from PDE discretizations, we show that our method takes less memory than standard 
sparse Cholesky codes and wall-clock time than standard preconditioners The remainder of 
the paper is organized as follows. In Section]^ we briefly review the standard supernodal 
left-looking sparse Cholesky algorithm on which our method is based. In the supernodal 
factorization, each supemode has an associated diagonal block storing interactions within 
that supernode, and an off-diagonal block for storing interactions between supernodes. In 
Section we describe how our algorithm forms and uses low-rank approximations to the 
off-diagonal blocks of the supernodes, and in Section]^ we describe our approach to hier¬ 
archical compression of the diagonal blocks. We discuss some key implementation details 
in Section]^ and illustrate the behavior of our algorithm on several example problems in|^ 
Einally, in Section]^ we conclude and give potential directions for future work. 

2. Background and Notation. We focus primarily on supernodal left-looking Cholesky 
factorization. This method has yielded implementations which make effective use of modern 
computing architectures to efficiently solve (O ©■ 

2.1. Supernodal Left-Looking Cholesky Factorization. Most sparse Cholesky codes 
have two phases: a fast symbolic analysis phase to compute the nonzero structure of L, 
and a more expensive numerical factorization phase in which the actual elements of L are 
computed. The symbolic analysis phase is organized around an elimination tree that encodes 
the structure of L: in general, kj f 0 precisely when there is some k such that Uik f 0 
and j is reachable from k by an elimination tree path that passes only through nodes with 
indices less than i. Often, the elimination tree has chains of sequentially-numbered nodes 
corresponding to columns with similar nonzero structure; these can be seen as supernodes in 
a coarsened version of the elimination tree. Supernodal factorization algorithms organize L 
around such supernodes, formed by collecting adjacent columns which are predicted to have 
similar non-zero patterns in L. Supernodal methods achieve high efficiency by storing the 
nonzero entries for a supernode together as a dense matrix, and by operating on that matrix 
with optimized kernels from the BLAS. 

Suppose L G is partitioned into M supemodes. Let Cj = (c : Cj < c < Cj+i) 

refer to the column indices in supernode j, and let 6^ = (c : c > Cj+i) be the list of columns 
occurring after supernode j. Einally, let Lj refer to the block column Lj = L(:, Cj). Since 
L is lower triangular, it follows that Lj(l : Cj — 1,:) = 0- We store the matrix Lj(Cj,:) 
explicitly as a dense matrix and refer to this as the supemode’s diagonal block Lj^. We also 
define tkj to be the list of nonzero rows of Lj below the diagonal block; that is, 

3?j = (fc G : 3p G Cj, 4,p 7^ 0) = • (2.1) 

Since columns in Cj have similar non-zero patterns, we store Lj (3?j,:) as a dense matrix and 
refer to this as supernode j’s (compressed) off-diagonal block L^. 

Left-looking algorithms such as the one implemented in Q form block columns Lj in 
order from left to right. We identify the descendants Dj of supernode j as follows: 

Dj = {1 < A: < _) : [Rfc n Cj 7^ 0} ; 


(2.2) 
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that is, supernodes from earlier in the factorization whose off-diagonal row set intersects the 
column set of node j. We also refer to node j as an ancestor of node k if k G Dj. For 
convenience, we also define index lists relating rows in supemode j to rows in a the off- 
diagonal block L® of a descendant supernode k: 

R^^,={l<p<\:Rk\:rlGe,), (2.3) 

R^^,=il<p<\:Rk\:rleJl,). (2.4) 

Intuitively, ( |2.3| l helps us extract the rows of that influence the contents of Lj^. Similarly, 
we use ( |2.4[ ) to extract rows of L® needed to form L®. We will also write j to denote 

the index list corresponding to R^_^j in the uncompressed structure, G Oik : p G R^^j^, 
and similarly for 01^^^. 

We also And it convenient to define the function scatterRows(B, i?i, R 2 }, where Ri C 
i ?2 are ordered index lists, and B has |i?i| rows. This function returns a matrix with |i? 2 | 
rows by placing the contents of rows of B in the output according to the positions of entries 
of Ri in i? 2 . For example. 


scatterRows 


1 2 
3 4 


,{3,8}, {2,3,5,8} 




1 

2 

0 

0 

V 3 

4 y 


(2.5) 


We similarly define the functions scatterColumns(B, Ci, C 2 ), and scatter(B, i?i, i? 2 ) Cij <^ 2 ) 
which composes scatterRows and scatterColumns. Finally, we define gatherRows as a 
function which reverses the operation of scatterRows; e.g., if scatter(B, i?i, R 2 ) = C then 
gatherRows(C, R 2 , Ri) = B. 

Forming the numerical contents of supernode j begins with assembly of a block column 
of the Schur complement; 


uP = 

'^3 


Cj) - scatter {Rk^j, :)L^ {Rk^j, 6. 


fceD,- 


( 2 . 6 ) 

Uf = A{0i„ e,) - ^ scatter (l^ :)lO 01^^^, 3l„ 6,) 

(2.7) 


fceiJ,' 


uf and uf are dense matrices with the same sizes as Lf and Lf. We note that if node k 
is a descendant of node j, then 3?^ n Cf C 3^^ . The Schur complement in node j is formed 
by first extracti ng dens e row subsets of Lf for each descendant k, then forming the matrix 
products from ( 2.6|2.7 1 using dense matrix arithmetic and Anally scattering the result to Uf 
and uf. Next, the diagonal and off-diagonal blocks of supernode j are formed as follows: 


Lf = chol (uf) (2.8) 

Lf =Uf (Lf)'^ (2.9) 

This procedure is summarized in Algorithmic 

2.2. Fill-Reducing Ordering. Figure |2T]provides an example of how a nested dissec¬ 
tion ordering might be used on a simple two-dimensional domain, and how this ordering 
influences All in the Cholesky factor of the reordered matrix. 
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Algorithm 1: factorSupernode: Computes the diagonal and off-diagonal blocks 
and for supernode j. Provided inputs are the matrix to be factored, as well as 
the partially constructed factor L. every supernode k G Dj (node j’s descendants) is 
assumed to have already been factored. 


input ; A, j, L, D^- 

output: L® 

1 begin 

2 // Initialize 

3 uf^A(e„e,) 

4 Uf^A(3^„e,) 


the 


Schur complement blocks 


5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 


for each k £ do 

// Build dense update blocks 

diagUpdate f— :) * 

offDiagUpdate ^ :) * :)^ 

// Scatter updates to the Schur complement 
Uf <— Uf - scatter(diagUpdate, Gj) 

Uf <— uf - scatter(offDiagUpdate, Gj) 

II Factor node j's diagonal block 
Lf <— cholesky(Uj^) 

// Dense triangular solve 

LO ^ U°(Lf )-^ 

return Lj^,L° 



Fig. 2.1. Fill during nested-dissection Cholesky factorization for a 2D mesh example (left). When nodes in a 
mesh are ordered so that a vertex separator appears last after the subdomains it separates (right), fill in the Cholesky 
factor (indicated by red circles) is restricted to the diagonal blocks corresponding to interactions within each sub- 
domain and within the separator and to off-diagonal blocks associated with subdomain-separator interactions. 


3. Off-Diagonal Block Compression. In (2.1 we introduced the notion of supernodes 
with dense diagonal and off-diagonal blocks hf and Jjf. In this section, we discuss the pro¬ 
cess of approximating off-diagonal blocks "Lf with low-rank matrices. The matrix "Lf stores 
interactions between supernode j, and other supernodes in occurring later in the factorization. 


3.1. Block Selection and Ordering. We use nested dissection ifTTl to construct a fill- 
reducing ordering as originally discussed in |2.2| We choose nested dissection because the 
geometric structure introduced by this method yields a factor matrix L in which many dense 
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Fig. 3.1. Nested dissection is applied to a regular, two-dimensional grid. At each level, the domain is re¬ 
cursively subdivided by the introduction of separators. Each image in the sequence depicts a level in the recursive 
dissection of the domain. Separators at the same level in the nested dissection hierarchy are rendered with the same 
color. In the rightmost figure we see that five levels of nested dissection fully decompose this domain in to subdomains 
of unit size. 


submatrices are amenable to low-rank approximation. We will discuss this property in more 
detail in p.3| Henceforth, we will assume that the matrix A has already been symmetrically 
permuted using nested dissection. We expect that the interactions between large separators 
in the factorization will exhibit rapidly decaying rank structure (see p.3l. Therefore, we 


represent each “large” separator from the nested dissection hierarchy with a supernode. The 
off-diagonal blocks associated with these supernodes describe interactions between large 
separators in the factor (see Figure [T^ . 

Standard sparse Cholesky solvers such as CHOLMOD may optionally use nested 
dissection for reordering. However, the process of constructing supemodes used by these 
solvers does not guarantee a one-to-one relationship between supemodes and separators from 
the nested dissection ordering. In our algorithm, we choose a tolerance tq € and intro¬ 
duce a supernode for every separator with at least tq variables. The remaining indices in our 
reordered matrix A are gathered in to supemodes using methods identical to 161 . 

3.2. Block Compression Scheme. Consider the state of the factorization immediately 
prior to forming the factor contents for supernode j: 


sym 

aO sym 

■^pre \0 A 

■^j -^post 


T ^ 

^pre 

T o 

^pre 



u 


UY U 


(3.1) 


post 


Here pre and post refer to the sets of columns occurring before and after supernode j, respec¬ 
tively. Note that the Schur complement VLpost is never formed explicitly since we only form 
Schur complements one supernode at a time. We also note that given the definition of Uj and 
Lj, it is necessary to apply the scatter operator to these matrices to make valid, but this 
has been omitted here for brevity. Following factorization of node j, we have; 


A^ 

pre 

A® 

pre 


sym 

sym 

Af 

sym 

AO 

-^post 


T ^ 

^pre 

T ^ 

^pre 




T ^ 

^pre 

T 

^pre 


L? 


U 


post 


(3.2) 


where 


U 


post 


= u 


post 


-U 




(3.3) 


Assuming A is positive definite, VLpost must also be positive definite. 
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Fig. 3.2. Consider the two-dimensional grid on the left, decomposed via nested dissection (see Figure \3J\ . 
If we set TQ = 5, then the three largest separators will be identified as supernodes (labelled ji, j 2 tind j^). The 
resulting matrix structure is shown on the right. The off-diagonal blocks and are shown in white and 
describe interactions between separator and separators ji and j 2 , respectively. In the domain picture on the 
left we see that, due to the way these separators intersect geometrically, these interactions tend to mostly occur 
over large distances. This justifies the use of low-rank matrices to approximate these interactions. The parts of the 
factor for which we do not apply any compression are shown in gray. These blocks can be evaluated either using the 
standard supemodal factorization algorithm, or using the interior blocks approach discussed in §5.7| 


If we approximate the off-diagonal part of supernode j with a low-rank matrix - L® « VU^ 
- then the Schur complement in ( |3.3| l is approximated by 

Upost = Upost - VU^UV^. (3.4) 

We choose V and U using a method similar to f23\ so that this modified Schur complement 
p.4| i is guaranteed to remain positive definite. Namely, we choose U to have orthonormal 
columns and V to be the projection of on to this basis; V = L^U. We can write = 

[V V][U U]^ where U is a (non-unique) matrix with orthonormal columns, U = 0, 
and V = L^U. Given these properties, we can rewrite ([33 as 

Upost = Upost - vv^ - V (3.5) 

and ( |3.4| ) as 

Upost = Upost - vv^ (3.6) 

= Upo«t+VV^. (3.7) 

_ 'J' 

Since Upost is positive definite and V V is positive semi-definite, it follows from ( |3.7| i that 
Upost remains positive definite under this approximation. 

3.3. Low-Rank Structure. Recall from §3.1 [ that we use nested dissection to reduce fill 
and represent large separators as supernodes in the factorization. Nested dissection orderings 
are built entirely based on A’s graph structure. However, in problems defined on physical 
domains (say, discretizations of partial differential equations on two- or three-dimensional 
domains) separators also have a convenient geometric interpretation. In these problems, sep¬ 
arators are geometric regions which bisect subdomains of the original problem domain (see 
Figures[TT|and[l!2]i. For example, in many three-dimensional problems the nested dissection 
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separators are surfaces which cut the domain in to disjoint pieces. Given a partial factoriza¬ 
tion of a matrix A, the remaining Schur complement TI behaves like a discretization of a 
boundary integral equation 0. For many problems, these discretizations will yield smooth 
coefficients for matrix indices which are geometrically distant from each other in the original 
problem domain. The structure of separators produced by nested dissection tends to ensure 
that the interactions between pairs of large separators occur mostly over large distances, with 
only a handful of “near-held” interactions (see Figure |3.2[ ). If supernode j corresponds to a 
large separator and IXf is this node’s off-diagonal block in the Schur complement, then we 
expect that IX^ should have rapidly decaying rank structure due to the property discussed 
above. As such, IX^ (and, likewise, L^) admits a low-rank approximation. Therefore, we 
compress the off-diagonal blocks in supernodes/separators which are sufficiently large (larger 
than To). We use the following notation for this low-rank approximation: 

Lf « VjUj where V^- s e IRie.|x9 and q < 16^1 (3.8) 


Algorithm 2: offDiagonalMultiply: Computes the product B = L® G given some in¬ 
put matrix G e > 0. The function diagonalSolve(j, X, transpose) applies 

the inverse of to the input matrix. If transpose is set to true, then diagonalSolve 
forms the product (Lj^)“^X instead. See ^and Algorithmfor a detailed descrip¬ 
tion of diagonalSolve. 


input : A, j, L, D^-, G 

output: B = L® G 

1 begin 

2 // Apply node j's diagonal inverse to the input 

3 G •(— diagonalSolve(j, G, transpose = true) 


4 

5 

6 

7 

8 

9 

10 

11 


// Multiply by the desired block from 

W ^ 

for each k e Dj do 

// Extract the required sub-matrix 
Gsub <— gatherRows(G, 6^, 

// Form the needed product with two 

n T 


T 

T 




,0 


G 


sub 


L°(i?0^,.,:)T 


A 

from G 

multiplications 


12 

13 

14 


// Scatter result to the output matrix 
W <— W — scatterRows(T, 

return W 


3.4. Compression Algorithm. Next we discuss our approach for forming the low-rank 
approximation « V^Uj. Our compression strategy must satisfy two requirements: 

1. The off-diagonal block L® may be expensive to construct and store. Therefore, we 
wish to build Vj, Uj without explicitly constructing L^. 

2. The original supernode factorization procedure presented in algorithm [T] makes ef¬ 
fective use of dense matrix arithmetic, allowing for very efficient implementations 0 
Our compression algorithm should preserve this property. 
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To satisfy these requirements, we use randomized low-rank approximation algorithms 
Similar randomized methods have been used previously in rank-structured sparse solvers Qol 
l29l . The key insight behind these randomized algorithms is that a “good” rank-g approxima¬ 
tion to a matrix B can be found by considering products of the form BG, where G is a 
randomly generated matrix with q+p columns and p > 0 is a small oversampling parameter 
(typically p « 5 — 10 is suitable). In this case, we consider a rank-g approximation to be 
good if it is close (in the 2-norm) to the best rank-g approximation provided by B’s singular 
value decomposition (S VD). If the singular values of B decay slowly, then obtaining such an 
approximation may require us to instead form products of the form C = (BB^)®BG, where 
s > 0 is a small number of power iterations. The theory behind these methods states that C 
provides a column basis for a low-rank approximation of B which is close to optimal. More¬ 
over, constructing C only requires a small number of matrix multiplications involving B and 
B^. Therefore, requirement 1 above is satished. We can also perform these multiplications 
in a way that leverages dense matrix arithmetic similar to Algorithm[T] satisfying requirement 
2 . 


Algorithm Inefficiently forums products L^G for an arbitrary dense matrix G. As dis¬ 
cussed earlier, we also require products of the form (L^)^ G. These products are formed 
by the function offDiagonalMultiplyTranspose. This function has similar structure fo Algo¬ 
rithm]^ Finally, Algorithm|^uses the ofiDiagonalMultiply and offDiagonalMultiply Transpose 
functions to form a low-rank approximation node j’s off-diagonal block, 

where Uj is chosen to have orthonormal columns (as discussed in f 3.21. We note that Algo¬ 
rithm m assumes the matrix is stored explicitly for all descendants k G Dj. In practice, 

some of these blocks may also have been assigned low-rank representations L® 

If this is the case, then we replace lines 10-11 in Algorithm|^with 


VfcU 


k ■ 


^prod ^ U/i; 

T4— Gsub 

T ^- XJprodT 

(3.9) 


4. Diagonal Block Compression. Section discussed the process of constructing a 
sparse Cholesky factorization in which off-diagonal interactions between large separators are 
approximated with low-rank matrices. In large, three-dimensional problems, larger separators 
may include thousands to tens of thousands of variables. For these problems, the compres¬ 
sion scheme from (j^can provide a significant reduction in both memory usage over standard 
factorizations, while still providing a factor that serves as an excellent preconditioner. How¬ 
ever, if supernode j is large, then building the dense diagonal matrix may also require 
significant computation and storage. In this section, we discuss an approach to compressing 
diagonal blocks L^. 

4.1. Low-Rank Structure. In ^we saw that low-rank behavior in off-diagonal blocks 
L® is exposed by the geometric structure of nested dissection. We can reorder variables 
within a separator to expose similar low-rank structure within diagonal blocks Lj^. Since all 
columns in a given supemode are treated as having the same fill pattern, we are can perform 
this reordering without affecting accuracy, memory usage, or computation time. Consider the 
top-level separator shown in hgure [TT] Suppose that the indices of vertices in this separator 
are ordered sequentially from top to bottom and that the 9x9 diagonal block for this separator 
is written as a 2 x 2 block matrix (assume, without loss of generality, that the hrst block 
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Algorithm 3: approximateOffDiagonal; Builds a low-rank approximation L® « 
VjUj for node j’s off-diagonal block. This algorithm assumes that node j’s diagonal 
block has already been factored. The function randomMatrix(m, n) generates an 
m X n matrix whose entries are drawn from a Gaussian distribution with mean 0 and 
unit variance. The function makeOrthonormal returns an orthonormal basis for the 
column space of its input matrix. This can be accomplished by means of - for example 
- a QR factorization. 


input : A, j, L, D^ , off-diagonal rank sj, number of power iterations q 
output: Yj , Uj such that 

1 begin 


V U'^ 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


// Initialize a random matrix 
G <— randomMatrix(|IR^ |, Sj) 


// Implicitly form the product L^G 
G ^— offDiagonalMultiplyTranspose(A,j, L,Dj, G) 

// Run additional power iterations 
for i = 1 to q do 

G <— offDiagonalMultiply(A, j,L,Dj, G) 

G ^— offDiagonalMultiplyTranspose(A, j, L, D^-, G) 

// Extract an orthonormal row basis for 
Uj <— makeOrthonormal(G) 

// Compute Yj by projecting on to U^- 
Yj <— offDiagonalMultiply(A,j,L,Dj,Uj) 


return V,, U, 


row/column has four entries, and that the second has five): 


D 

3 


Lii 

L21 


0 

L22 


As a result of the ordering discussed above, L 21 stores interactions between vertices in the 
top half of the separator with vertices in the bottom half. As we discussed in §3.3| the spatial 
separation between these groups of variables suggests that L 21 can be approximated with a 
low-rank matrix L 21 « VU^. We can apply this argument recursively to Ln and L 22 to 
achieve further compression. 

The example above assumes that the rows/columns of are ordered such that off- 
diagonal blocks of exhibit low-rank structure. Finding such an ordering is straightfor¬ 
ward when A comes from a PDF discretization on a regular mesh like the one pictured in 
Figure |3.1| However, obtaining a suitable ordering for general, three-dimensional problems 
on irregular domains is nontrivial. 

Recall from p.3| that separators in the nested dissection hierarchy are geometric regions 
partition the original problem domain. In three dimensions, we intuitively expect these sepa¬ 
rators to look like two-dimensional surfaces inside of the original problem domain. Our solver 
exposes low-rank structure in by reordering indices within separators so that off-diagonal 
blocks in describe interactions between spatially separated pieces of the separator region. 
We begin by assigning a three-dimensional position : i S Cj to each index associated with 
supemode/separator j. There are many techniques for spatially partitioning the positions x^. 
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Af(/i,/i) sym. 

1 

J sym. 

1 

A.f(h,h) ! Af(/2,J2) 

Af(/3UJ4,JlUJ2) 

sym. 

1 Af(/4,J3) Af{h,h) 


Fig. 4.1. We recursively partition supernode j’s variable indices Gj resulting in the tree structure shown 
on the left (assuming two levels of partitioning in this case). We permute the indices in Gj so that, following this 
permutation, the index blocks associated with leaves in this tree structure appear sequentially along the diagonal of 
A^. The resulting permutation to A® for this two-level example is shown on the right. 


Currently, we use a simple axis-based splitting scheme. We partition the positions in to 
two subsets by sorting them along the longest bounding box axis of the set {x^ : j £ C^ } and 
splitting this sorted list in to two equal-sized pieces. This process is applied recursively until 
the separator has been partitioned in to subdomains with at most td variables. The paramter 
Tp) is chosen in advance as the size of the largest diagonal block that we wish to represent 
explicitly in the factor matrix. We use this partitioning to reorder the indices within a supem- 
ode in a way that exposes low-rank structure. See Figure |4~T| for an explanation of how this 
permutation is built. Using the two-level partitioning example shown this hgure, we label 
blocks of the diagonal factor block as follows: 


- 

^3 - 


T D 

. b’l 

T ^ 

S',2 


0 

1 

+ - - 

J 

T 77 
S',3 

1 

1 



I 77 

3,5 

S,4 

T 77 

S',6 


T D 

S.7 


(4.1) 


Blocks are numbered in the order in which they must be formed during factorization (intu¬ 
itively, top to bottom and left to right). In this 4x4 example, Lj^ is be approximated as 
follows: 


D 



0 1 

1 n 



5 

' 

1 


T ! ^f,5 

0 


T 77 
S',7 


(4.2) 


For block s in this matrix, let and be the set of rows and columns over which block s 
is defined, relative to h^. That is, Similarly, let and C^g refer to 

the same row and column sets, but relative to the entire factor L, so that L(!kj^g, Cfs) = ^fs- 
In summary, we permute the original matrix A in two main steps. The first is fill-reducing 
ordering using nested dissection. As we noted in this step exposes low-rank structure in 
certain off-diagonal submatrices of L, allowing for compression. The second stage of this 
permutation consists of reordering indices within certain supernodes formed in the first stage 
- namly, those associated with large separators. This does not alter the sparsity of L, but does 
allow us to compress certain off-diagonal submatrices of these large diagonal blocks. 
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Fig. 4.2. Block row of a factor in which node j (on the right) has a hierarchically compressed diagonal matrix 
with structure given by j4.1) . Compressed matrices are indicated with a dashed border. Consider the compressed 
block L?g. When forming products with this matrix (for the purpose of compression), we require contributions from 
all columns of L with non-zeros in the region highlighted with a thick black border. Observe that this includes 
contributions from descendents Dj, as well as other compressed blocks in In this case, L^g depends on three 
of node j’s descendents, as well as 

4.2. Compression Algorithm. Next, we turn to compression of off-diagonal blocks 
within a diagonal matrix As in the compression methods discussed in 0 we use random¬ 
ized methods to construct low-rank matrix approximations. First, we define me diagonalSolve 
function, originally introduced in Algorithm We consider a slight variation on this func¬ 
tion, in which the index s of a block from ijj is provided as an argument. Invoking this 
function with argument s solves a system of equations using the smallest diagonal sub-block 
of containing Using ( |4.2| ) as an example, calling diagonalSolve with s = 2 would 
solve a system using the inverse of 


T D 


^2 (Ur^) 


\T 


T D 

S'.3 


(4.3) 


21l solves systems 
Throughout the 
as the labels of 


For brevity, calling diagonalSolve with no “s” argument (as in Algorithm 
using the entire matrix This process is summarized in Algorithm 4 
algorithms discussed in this section, we treat the indices s of blocks in 
nodes in an in-order traversal of a complete binary tree. When we refer to a child or parent 
of s, we mean the in-order index of the node which is the child or parent of the node with 
in-order index s in this tree. For example, if has 7 blocks (as in ( |4.1| i), then s = 4 is the 
root of this tree and has left and right children with indices 3 and 6, respectively. 

We will now use the diagonalSolve function to build an algorithm for compressing an 


off-diagonal block As before, we multiply hfs random matrices without explicitly 




constructing As in p.4 these operations depend on the contents of node j’s descendants 


. In addition, we may need to consider contributions from previously compressed blocks 


within LU. See Figure 4.2 for a visual representation of this dependence. 


The diagonalMultiply algorithm (Algorithm]^ provides the details of this procedure. 
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Algorithm 4: diagonalSolve: Solves a system of equations using the submatrix of 
rooted at block s in Lj^’s hierarchical structure. 

input : j, G, transpose, s 

1 begin 


2 

// Leaf nodes correspond to dense diagonal blocks 

3 

if Node s is a leaf then 

4 


if transpose then 

5 



return {LfJ-^G 

6 


else 

7 



return (Lj^J’^G 

8 

else 


9 


// Partition G into two parts 

10 


Gi = G(l:|CfJ,:) 

11 


G 2 = G{\Cj^,\ + l:end,:) 

12 


II Child blocks of s 

13 


Si = left in-order child of s 

14 


S 2 = right in-order child of s 

15 


if transpose then 

16 



// Backward substitution 

17 



Gi < — diagonalSolve(j, Gi, transpose, si) 

18 



G2 ^ G2 - ((Uj^J^Gi) 

19 



G2 < — diagonalSolve(j, G2, transpose, S 2 ) 

20 


else 

21 



II Forward substitution 

22 



G2 ^ — diagonalSolve(j, G2, transpose, S 2 ) 

23 



Gi < — Gi - ((Vj^J^Ga) 

24 



Gi < — diagonalSolve(j, Gi, transpose, si) 

25 








Lines 7-15 in this algorithm resemble the descendant multiplication from Algorithm]^ Lines 
17-35 compute contributions from other blocks inside of Lj^. As before, we also require 
the algorithm diagonalMultiplyTranspose. This algorithm has a similar structure to Algo¬ 
rithm]^ With these two functions, we define a function approximateDiagonalBlock which 
computes a low-rank representation of given some prescribed rank. The structure of this 
algorithm is not given here since it is nearly identical to Algorithm]^ 

Finally, we turn to the question of how to construct dense diagonal blocks within L. As 
in Algorithm]!] lines 7 & 10, we will consider update matrices built from off-diagonal blocks 
in node j’s descendants. In addition, we will need to consider contributions from previously 
computed low-rank blocks in . The details of this process are given in Algorithmj^ 

Given algorithms for forming diagonal and off-diagonal blocks in building this ma¬ 
trix follows a straight forward process of iterating over the blocks of in increasing order 
s = 1,2,.... At each iteration, we either form the diagonal block or a low-rank de¬ 
composition Finally, we note that, unlike the off-diagonal compression scheme 
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presented in {3.2 forming diagonal blocks and approximate off-diagonals in this order does 
not guarantee that positive definiteness is maintained throughout the factorization. We discuss 


our simple method for addressing this issue in [ 5.3 


4.3. Choosing Diagonal Block Coordinates. In S 4J]|43 we assumed that indices within 


Ijj could be reordered to expose low-rank structure in off-diagonal blocks As we dis¬ 


cussed in ^4.1 this is accomplished by assigning spatial coordinates to degrees of freedom 


within Lj. Indices are reordered such that off-diagonal blocks in Lj describe interactions 
between spatially separated “pieces” of the separator with which node j is associated. How¬ 
ever, we have not yet discussed how these spatial coordinates are determined. In many appli¬ 
cations, this information can be determined from the underlying PDE. For example, in ^we 
discuss several model problems implemented in the Deal.II finite element library. For these 
problems, spatial coordinates are determined directly from node positions in a finite element 
mesh. Unfortunately, this information may not be readily available in some cases. In the in¬ 
terest of building a general, algebraic preconditioner, we wish to also consider cases in which 
spatial coordinates for system degrees of freedom are not provided. 

Research in the area of graph visualization has yielded a variety of methods for building 
visually appealing drawings of graphs Il22l l4l. Given a sparse matrix A, we can infer geomet¬ 
ric positions for matrix indices by applying these algorithms to the graph structure implied 
A’s non-zero pattern. In this work, we appeal to spectral graph drawing algorithms, which 
build positions based on the spectral properties of certain matrices associated with the original 
system matrix A. In particular, we consider A’s Graph Laplacian L, defined as follows: 


Ly- — 


-1 

0 j, 

\{k ^ i : Aik ^ 0}\ ifi = j 


Aij ^ 0 

A„ = 0 


(4.4) 


We evaluate the three lowest-order eigenvectors vi,V 2 ,V 3 of L and associate the three- 
dimensional position [vii V 2 i Va^] with matrix index i. We find that low-accuracy approxima¬ 
tions of these eigenvectors suffice, and we evaluate these eigenvectors using Arnoldi iteration. 
See (j^for further discussion on the cost of constructing these positions. 

5. Additional Optimizations and Implementation Details. In this section we discuss 
additional optimizations for further storage reduction in our algorithm, as well as key imple¬ 
mentation details. 

5.1. Interior Blocks. The algorithm discussed in fjSjl^builds a sparse Cholesky factor 
on a matrix permuted with a nested dissection ordering. Separators in the nested dissection 
hierarchy that are sufficiently large - that is, having more than tq variables - are identified as 
supemodes in a supernodal Cholesky factorization and the diagonal and off-diagonal blocks 
for these supernodes are compressed. Supernodes with fewer than tq variables may be fac¬ 
tored using the standard supernodal sparse Cholesky algorithm; however, in this section we 
present a more memory-efficient method for handling these uncompressed blocks. 

The collection of compressed separators discussed above partitions the domain in to a 
collection of mutually disjoint subdomains (see Figure [5T| - left side), which we refer to as 
interior blocks. This remains true even for non-physical problems in which the “domain” is 
the graph defined by the sparsity pattern of the matrix to be factored. The nested dissection 
permutation guarantees that the variables in an interior block appear in a contiguous block in 
the reordered matrix (see Figure [5T| - right side). As such, an interior block can be repre¬ 
sented by a sequential list of supernode indices. For interior block i (numbered in the order 
in which it appears in the reordered matrix), we use the notation to denote the list of su¬ 
pernode indices comprising the block. The column list C® and off-diagonal row pattern tR® 
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Fig. 5.1. As in Figure [T2] we apply nested dissection to a two-dimensional grid and set tq = 5. The 
separators with 5 or more variables partition this domain in to four disjoint subdomains, which we label as interior 
blocks These four blocks are shown as regions surrounded by a dashed line in the left image. The block 

structure in L resulting from this partitioning is shown on the right. As before, solid white blocks in this matrix denote 
off-diagonal blocks for which we apply compression according to the methods in ® Dashed white blocks 
denote the off-diagonals of interior blocks These blocks are not stored explicit^ We do, however, store 

the diagonal components of interior blocks explicitly, and these entries are evaluated using standard, supernodal 
Cholesky factorization restricted to the interior block subdomain. 


associated with interior block i are defined as follows: 

ef=Ue. (5.i) 

ieBi \iGBi j 

The matrices L(C®, Cf) and L(tR®, 6®) are the diagonal and off-diagonal factor blocks for 
interior block i (the dashed/white and shaded matrix blocks in Figure [5T| respectively). In¬ 
dices within interior block i are reordered via nested dissection to guarantee that L(C®, 6®) 
is as sparse as possible. However, for our purposes we can think of interior blocks as rep¬ 
resenting the “bottom” level of the nested dissection hierarchy. When building a Cholesky 
factorization, blocks in a nested dissection hierarchy only depend on blocks from lower levels 
in this hierarchy. Therefore, the factor contents for interior block i are evaluated as follows: 

L(e®, ef) = choi(A(e®, e®)) ( 5 . 2 ) 

L(3i®, ef) = A(3i®, e®)L(e®, e®)-^ (5.3) 

We compute L(C®, C®) using a standard, uncompressed supernodal factorization. 

The matrix L(tR®, Cf) only stores interactions between the variables of interior block 
i and supernodes compressed using the methods of (J3]|^ (see Figure HD- As a result, 
L(!k®, 6®) may have many non-zero entries, making it expensive to store explicitly. Fortu¬ 
nately, we c an still approximately factor A without ever explicitly forming the block L (IR®, C®). 
In the standard supernodal factorization (Algorithm[^, we explicitly form the Schur comple¬ 
ment matrix for each supemode. Here, L(tR®, C®) must be stored explicitly because it is 
required when running Algorithm[2on ancestors of nodes in interior block i. The key insight 
of our approach is that our factorization algorithm only uses L(tR®, 6®) in places: 

1. We explicitly build Schur complements for small diagonal blocks with fewer than 
To rows/columns. This may depend on contributions from L(!R®, 6®). 
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2. We must be able to form products of the form L(3?f, e®)G and L(3?®, ef)^G where 
G is an arbitrary dense matrix. These products are required by Algorithms]^ and 
Forming the diagonal blocks referred to in the hrst requirement only necessitates the 
formation of a sub-block of L(3i®, C®) with at most tjj rows (see Algorithm]^ line 8). This 
sub-block is needed exactly once for the formation of a diagonal block. We can form small 
sub-blocks of L(3?f, Cf) as needed to build diagonal block Schur complements, then discard 
them immediately afterwards. 

We also observe that the products from the second requirement listed above can be 
formed without explicitly forming any part of L(3?f, Cf). Suppose that we wish to com¬ 
press blocks in supernode j, and that this node has some descendents in interior block z; that 
is, Dj n Bi 7 ^ 0. Forming the product L® G for some dense matrix G can be done by con¬ 
sidering each descendent in Dj individually, as is done in Algorithm]^ Alternately, we can 
consider all of the descendents in interior block i simultaneously. To do this, we first recall 
that L® = L(C®, 6®) is computed explicitly using sparse supernodal factorization, meaning 
that its inverse can be applied quickly. It follows from Algorithm that interior block z’s 
contribution to the Schur complement IX^ is given by 

L(3i,,e»)L(e„e»)^. (5.4) 


We can use (|5.2[|5.3|l to express the two matrices involved in ( |5.4| ) 


as 


L(3^„e®) = A(3^,,e®)(L®) 


-T 


L(e„e®) = A(e„e®)(L^ 


, -T 


(5.5) 


Finally, we can use ( |5.5| l to write the product of ( |5.4| l and an arbitrary dense matrix G (as 
required by Algorithm]^ as 


L(3?j , 6; 


*)L(e,,q 


)^G = A(3?„e») (L®)-^ (L«)-^[A(e„e»)^G] 


(5.6) 


As suggested by the parenthesis in ( |5.6| l, this product is the result of multiplying a sparse ma¬ 
trix with G, followed by two sparse triangular solves involving Lf, followed by another mul¬ 
tiplication with a sparse matrix. Since each of these operations can be carried out efficiently, 
this provides an effective method for forming the matrix products required by Algorithm 
without having to explicitly store blocks of the form L(!k®, C®). A similar method can be 
used to form products with blocks by replacing and Gj in ( |5.6| l (see Algorithm 
with and respectively. 

While the optimizations discussed here have the potential to signihcantly reduce storage 
requirements, this comes at the cost of somewhat more expensive factorization and triangular 


solves. We provide concrete examples of this time-memory tradeoff in ^6.3 


5.2. Estimating Rank. Up until now, we have assumed when building a low-rank ap¬ 
proximations to blocks in L that the desired rank for each block is known a priori. In this 
section, we discuss how block ranks are chosen. We use our approximate rank-structured 
Cholesky factor as a preconditioner for the Preconditioned Conjugate Gradient method. There¬ 
fore, we are also free to use simple heuristics to determine block ranks, with the understand¬ 
ing that the accuracy with which we approximate blocks will influence the effectiveness of 
our preconditioner. In principle, we could adaptively approximate blocks up to a certain 
tolerance (see, e.g., cana); however, our experiments showed that the additional cost of 
adaptive approximation outweighed the improved accuracy of preconditioners built with this 
method. Instead, we use a simple heuristic function depending only on the number of rows 
and columns in the block to be approximated. In particular, we assign the following rank to a 
block B e 


rank(B) = aVk\og 2 {k) +p 


(5.7) 
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Fig. 5.2. Interaction between two square separators with grid-based topology. Assuming that each separator 
has k variables, then the number of variables immediately adjacent to each other in the interaction between these 
two separators is s/k. 


where k = min(TO, n) and p is a small oversampling parameter. To provide a brief, intuitive 
explanation as to why this function was chosen, we consider the interaction between two 
separators in 3D space. Suppose that the separators take the shape of regular, square, two- 
dimensional grids intersecting at a right-angle (see Figure [5^ . This may be the case in, for 
instance, a PDF discretized on a regular, three-dimensional finite difference grid. Assuming 
that each separator has k variables, there are Vk in each separator which are immediately 
adjacent to the other separator. It is for this reason that we include a term proportional to y/k in 
( |5.7| l. We also scale rank(B) by log 2 (/c) since we empirically observe better preconditioning 
behavior when larger ranks are used to approximate larger blocks from L. In practice, we also 
use two different constants in ( |5.7| l - and a'^ - which determine ranks during diagonal 
and off-diagonal compression, respectively. 


5.3. Avoiding Indefinite Factorizations. As discussed originally in (4.2 our scheme 
for compressing diagonal blocks does not provide a guarantee that Schur complements 
formed during the factorization will remain positive definite. This could result in our factor¬ 
ization algorithm failing for certain inputs. Fortunately, the argument in p.2| guarantees that 
in the absence of compression of diagonal blocks Lj^, all Schur complements remain positive 
definite. This implies that we can avoid the indefinite Schur complements by approximating 
diagonal blocks with sufficient accuracy. In practice, we address this issue by adapting the 
diagonal compression parameter in the event that an indefinite diagonal matrix is encoun¬ 
tered during factorization. Specifically, we initialize ao = 0.5 and if factorization fails due 
to an indefinite matrix, we increase this constant d/j <— 1.25aD and restart the factorization 
process. This strategry increases the accuracy with which diagonal blocks are approximated 
until factorization is successful. 


6. Results. We have applied the method described in this paper to a number of sample 
problems. In §6.1| we demonstrate the behavior of our solver on a challenging nonlinear 
elasticity problem. When applied to the linear systems arising in this problem, our solver 


provides significant performance improvements over a variety of standard solvers. In (6.2 


we discuss the behavior of our solver on a variety of other sample problems. We consider 
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both Standard examples implemented using the deal.II hnite element analysis library Glia 
and examples taken from the University of Florida sparse matrix collection lO. While the 
performance differences between our solver and standard direct and iterative solvers are less 
dramatic in these examples, these results demonstrate the robustness of our solver. 

6.1. An Example: A Nonlinear Elasticity Problem. In this section, we discuss an 
example that illustrates the behavior of our current solver. While we have tested our solver 
on numerous problems, the problem described here poses particular difficulty for standard 
iterative methods. As such, it is an ideal candidate for a hybrid approach such as ours which 
leverages the reliability of direct solvers with the low memory overhead of iterative methods. 

We evaluate our problem using a benchmark problem taken from based on a standard 

example from the deal.11 hnite element analysis library EE). This simulation models quasi¬ 
static loading of a nearly-incompressible, hyperelastic block under compression. The code 
applies a force incrementally over two 
load steps; at each load step, a nonlin¬ 
ear system of equations is solved to de¬ 
termine the resulting deformation of the 
block. The nonlinear system is solved 
by a Newton iteration, and we evaluate 
the performance of our solver for solv¬ 
ing the sequence of linear systems that 
arise during this process. 

Because this problem is nearly in¬ 
compressible, standard displacement- 
based elements would be prone to lock¬ 
ing. Consequently, our test problem uses 
a mixed formulation with explicit pres¬ 
sure and dilation held variables in ad¬ 
dition to the displacement helds. We F\G. 6.1. High-resolution (N = 80), elastic Mock under 

consider two versions of this problem — compression. Our sparse rank-structured preconditioner was 
henceforth referred to as the p = 1 and deformations seen here, 

p = 2 problems. In the p = 1 problem, displacements are discretized with continuous lin¬ 
ear Lagrange brick elements, while pressure and dilation are discretized using discontinuous 
piecewise constant functions. The p = 2 problem discretizes displacements with quadratic 
elements and uses discontinuous linear elements for pressure and dilation. All variables are 
discretized on an iV x x element grid. In all problem instances, the pressure and di¬ 
lation variables are condensed out prior to the linear solve, so the system we solve involves 
only displacement variables. 



6.1.1. Comparison to Standard Iterative Solvers. We have solved the benchmark 
problem with the preconditioned conjugate gradient (PCG) iteration using our rank-structured 
Cholesky preconditioner and several other preconditioners provided in Trilinos ifTTl l20l ItTI 
num, a library of high-performance solvers developed primarily at Sandia national labs. 
Our code consistently out-performed a Jacobi preconditioner, an incomplete Cholesky (ICC) 
preconditioner, and a multi-level (ML) preconditioner in both iteration counts and wall clock 
time (Figure 6.2(a) -672(b)|l. Our timing results are summarized in Table 6.1 


All results repotted in this section were generated on an 8-core Intel Xeon X5570 work¬ 
station with 48GB of memory running Ubuntu 12.04, with LAPACK and BLAS implemen¬ 
tations provided by the Intel Math Kernel Library version 11.0. We use the preconditioned 
conjugate gradient (PCG) implementation provided by AztecOO for all tests. All linear sys¬ 
tems were solved to a relative £2 residual error threshold of 10“®. At this accuracy level, the 
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Seconds '10^ 


Fig. 6.2. PCG convergence for the first linear solve in the p = 1 benchmark problem with N = 50. Jacobi, 
ICC, ML and RSC refer to solves preconditioned with Jacobi, incomplete Cholesky (IFPACK), multigrid and rank- 
structured Cholesky (our solver) preconditioners, respectively. Convergence curves relative to wall clock time start 
att>0 due to time required to construct the preconditioner. 


nonlinear iteration required 14-15 linear solve steps to converge (as compared to 12 linear 
solves for a standard Cholesky solver). Due to the time required to solve linear systems using 
the standard preconditioners, we only ran these example for N < 50. For the rank-structured 
Cholesky solver, we ran examples up to iV < 80. 

We observe the following properties for the different preconditioners for this problem: 
Jacobi: The Jacobi preconditioner (diagonal preconditioner) is simple, but it usually only 
modestly accelerates convergence. When solving the p — 1 problem, more iterations are 
required to converge with the Jacobi preconditioner than with more sophisticated precondi¬ 
tioners. However, each iteration is so cheap that this process requires less wall clock time than 
solves performed with other standard preconditioners. Meanwhile, in the p = 2 problem, the 
number of iterations required when using a Jacobi preconditioner grows considerably, making 
other standard preconditioners more competitive. 

ICC: We timed the PCG iteration using incomplete Cholesky (ICC) preconditioners imple¬ 
mented in both IFPACK and AztecOO. As with most incomplete factorization codes, these 
solvers require several parameters, including the level of fill allowed in the factorization, drop 
tolerances dictating which matrix entries should be discarded, and parameters controlling 
perturbations to the matrix’s diagonal. The latter are required to avoid poorly conditioned 
factorizations since solvers based on incomplete factorizations appear to encounter severe 
conditioning issues when applied to this problem. In general, “good” parameter choices de¬ 
pend on the problem. We chose parameters for this problem based on experiments with a 
small problem instance (e.g., N = 20 for the p = 1 problem). When solving the p = 1 prob¬ 
lem, even with the tuned parameters, we require almost as many iterations with IFPACK’s 
ICC preconditioner as with the much simpler Jacobi preconditioner. Moreover, the incom¬ 
plete Cholesky preconditioner costs more than applying the Jacobi preconditioner, so the 
overall time to solve the linear systems is actually larger in this case. Meanwhile, we find 
that Aztec’s ICC preconditioner is unable to make any progress towards convergence in the 
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Seconds '10^ 


Fig. 6.3. PCG convergence for the first linear solve in the p = 2 benchmark problem with N = 35 (see the 
caption for Figure ^.2ia)\6.2(Fj\ . Note that the Jacobi solve was run to convergence: the left plot is truncated for 
clarity. 


p — 1 problem. We observe the opposite behavior in the p = 2 problem. Specifically, IF- 
PACK’s ICC preconditioner makes no progress towards convergence, whereas the AztecOO 
preconditioner performs reasonably well (but still significantly slower than our rank struc¬ 
tured solver). In fact, while IFPACK’s ICC solver was the least effective standard solver (in 
terms of wall clock time) applied to the p = 1 problem, AztecOO’s ICC solver was the most 
effective standard solver applied to the p = 2 problem. This phenomenon provides further 
evidence of the difficulty associated with choosing an effective preconditioner for a given 
problem. 

ML: While multigrid preconditioners perform well on many problems, on our benchmark 
we see relatively poor convergence and long solve times. We use an algebraic multigrid 
preconditioner that solves the problem at its coarsest level using a direct solver provided by 
Amesos. The next coarsest level applies a symmetric Gauss-Seidel smoother over several 
sweeps (4 in this case). Finer levels use a degree-2 Chebyshev polynomial smoother. These 
parameters were chosen based on good convergence behavior on a small problem instance 
(e.g., N — 20 for the p = I problem). When solving the p = 1 problem, this solver requires 
significantly fewer iterations than the Jacobi or incomplete Cholesky solvers; but because 
applying the preconditioner is relatively expensive, it takes about as long to solve with the 
multigrid preconditioner as with a Jacobi preconditioner. When applied to the p = 2 problem, 
the multigrid preconditioner requires significantly fewer iterations than the Jacobi solver, but 
many more than the AztecOO incomplete Cholesky solver. As a result, this solver is the least 
effective standard solver that we tested on the p = 2 problem. We also note here that multigrid 
frameworks such as the one provided by ML require tuning a wide variety of parameters 
(some of which are discussed above). In some cases, tuning problem-specific parameters to 
achieve good convergence behavior may outweigh the cost of solving the problem with a 
simpler method. 

Rank-Structured Cholesky: The conjugate gradient method preconditioned with our rank- 
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Structured Cholesky solver converges quickly, both in terms of the iteration count and in terms 
of wall clock time. This is a signihcant improvement over the other preconditioners. 




Fig. 6.4. Mean times and memory usage for the p = 1 benchmark problem with problem between N = 2{) and 
N = 80. Solvers based on direct factorization are much faster than the standard preconditioners in this benchmark. 
Although it takes somewhat longer to solve systems with our rank-structured solver than with an exact factorization, 
the memory requirements of the latter approach make it infeasible for larger problems. 


6.1.2. Comparison to Exact Factorization. Beside comparing to standard precondi¬ 
tioners, we also compare our code to an exact sparse Cholesky factorization. As discussed 
above, our sparse Cholesky implementation closely mirrors CHOLMOD 0 and achieves 
similar performance and memory usage for exact factorizations. Because CHOLMOD places 
restrictions on problem size, we use our code for both the rank-structured approximation and 
the exact Cholesky factorizations 


In Figure 6.4 we show how much time and memory we need to solve linear systems 
with the rank-structured and exact sparse Cholesky factorizations. For this benchmark, both 
the rank-structured and the exact Cholesky solvers are much faster than the standard precon¬ 
ditioned iterations. The rank-structured Cholesky solver is somewhat slower than the exact 
Cholesky solver; but the memory requirements of the latter approach make it infeasible for 
larger problems. 


6.2. Other Sample Problems. In this section, we discuss other sample problems arising 
either from hnite element discretizations in Deal.II or from the University of Florida Sparse 
Matrix collection. For some examples, we also consider the effect of varying the number 


of power iterations used for low-rank approximation in our solver (see t3.4i. We hnd that 


increasing this number can result in somewhat more accurate approximation, and a modest 
reduction in PCG iterations. 


Finite eiement anaiysis of a trabecular bone: We consider the stiffness matrix produced 
by hnite element analysis of a three-dimensional trabecular bone model. The matrix used 
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in this problem is provided in the University of Florida sparse matrix collection has di¬ 
mension n = 986703 and has 24419243 non-zeros in it’s lower triangular component. We 
use low-accuracy eigenvectors of the matrix’s graph Laplacian to compute three-dimensional 
coordinates for degrees of freedom in this system (see (4.3 i. Since no right-hand-side vec¬ 
tor is provided for this problem, we solve the system Ax = b with the constant vector 
b = (1 1 ... 1)"^. 



Iterations ’10^ 



Seconds '10^ 


Fig. 6.5. PCG convergence for the trabecular hone problem. We compare results using incomplete Cholesky 
(ICC), a multigrid preconditioner (ML), and our solver using s = 1 or 2 power iterations when building low-rank 
approximations (RSCI and RSC2, respectively). We see that the improved accuracy from more power iterations 
results in a modest reduction in solution time. 


Finite eiement analysis of a steei fiange: Here, we consider a linear system arising from a 
three-dimensional mechanical problem discretizing a steel flange. This example can be found 
in the University of Florida sparse matrix collection]^ has dimension n = 1564794 and has 
57865083 non-zeros in it’s lower triangular component. 

Poisson’s equation: We apply our solver to Poisson’s equation 

-V-K(x)Vp = /inU (6.1) 

p = g on on (6.2) 

where the coefficients K(x) are optionally both inhomogeneous and anisotropic. The basic 
setup of this problem follows a standard example from the deal.ll library In particular, 
rather than solving (|6.1|6.2|l directly, we define u = — KVp and consider the mixed fonnu- 


'http://www.else.uf1.edu/research/sparse/matrices/Oberwolfach/boneOlO.html 
"^http: // www. else . uf 1. edu/research /sparse/matrices/ Janna/Flan_1565 .html 
'http://WWW.dealii.org/developer/doxygen/deal.11/step_20.html 





















Fig. 6.6. PCG convergence for the steel flange problem. 


lotion of this problem; 


K~^u + Vp = 0 in 0 
—V • u = —/ in n 
p = g on dfl 


(6.3) 

(6.4) 

(6.5) 


This problem is discretized using Raviart-Thomas elements, resulting in a linear system of 
the form 



Block elimination of ( |6.6| l yields the following block system: 

SP = BM-if-g (6.7) 

MU = f - B^P (6.8) 

where S is the positive definite Schur complement matrix S = BM~^B. ( |6.7[ ) can be solved 
via PCG given an efficient procedure for forming matrix-vector products with S. This in turn 
requires efficient application of M~^. This will also be accomplished with the conjugate 
gradient method. In the case K(x) = const, linear systems involving M turn out to be quite 
easy to solve. In fact, these systems can be solved quickly using standard conjugate gradients 
with no preconditioning. Nevertheless, this problem provides a useful benchmark for our 
solver. We also consider versions of the problem in which K(x) is anisotropic and highly 
inhomogeneous to demonstrate that our solver can also handle these cases. 

6.3. Comparisons. In this section we provide timing and memory usage statistics for 
some of the features discussed in (|3]|^ 
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Diagonal block coordinates: In §4.3| we discussed how spatial coordinates for diagonal 
block indices can be chosen either geometrically based on information obtained directly from 
a problem’s discretization, or algebraically via approximations of the low-order eigenvectors 
associated with the problem’s graph Laplacian. We compare both approaches applied to a 
linear system taken from the nonlinear elasticity benchmark problem ( §6.1| i. We find that both 
approaches produce preconditioners which converge in a comparable number of iterations. 
Recall that these spatial coordinates are used to permute indices within compressed diagonal 
blocks. We also consider results when randomly permuting these indices to demonstrate the 
need for an effective permutation. In this case we observe a significant increase in the number 
of required PCG iterations. We also note that 
the diagonal rank constant tu had to be increased 
several times in this case to allow for success¬ 
ful factorization (see Q. Convergence plots 
for these three diagonal block orderings are pro¬ 
vided in Figure |6.7| Finally, it is worth not¬ 
ing that the ordering provided by the original 
problem discretization and nested dissection order¬ 
ing may be sufficient for diagonal block compres¬ 
sion. For the problem considered here we find 
that we observe little difference in the PCG con¬ 
vergence behavior even when no additional per¬ 
mutations are applied to diagonal block indices. 

However, it should be noted that this is not 
guaranteed to be the case, and a poor choice 
of diagonal coordinates can lead to poor perfor¬ 
mance (see the random reordering result in Fig- 

I . Fig. 6.7. PCG convergence using ran- 

ure |0. /p . diagonal block reordering, as well as re¬ 

ordering based on geometric index positions 

Interior block performance: In jjSHjwe discussed fCeo.j and positions obtained from a low- 
how to avoid building and storing certain factor blocks^£xpS:f(^”fo^¥u^f}fei‘ reduce mem¬ 
ory usage. While this method reduces storage requirements, it comes at the cost of increased 
factorization and triangular solve time. Here we compare memory usage and solution time 
results for two examples computed with and without this optimization. 



7. Conclusions. In this paper, we have described a direct factorization method for the 
solution of large sparse linear systems that arise from PDF discretizations. Like standard 
direct solvers, our approach is black box, and can work with the pre-assembled matrix without 
prior information about details of an underlying mesh or a specific discretization method. By 
taking advantage of the low-rank block structure that arises from the underlying PDE, our 
method requires significantly less memory than standard direct methods, but through careful 
code organization we retain the high performance of standard direct solvers through use of 
level-3 BLAS and LAPACK calls. We have demonstrated through examples that our approach 
retains much of the robustness of standard direct solvers, and yields a faster time to solution 
than the standard multilevel algebraic multigrid preconditioner ML. 

Limitations and Future Work: So far, our work has focused solely on symmetric and pos¬ 
itive definite matrices. Other authors have showed how to deal with indefinite problems, 
with a particular focus on Helmholtz equations, and we intend to adapt that work along with 
standard static pivoting approaches developed in the context of ordinary sparse parallel LU 
decompositions. We also so far only have limited parallelism through threaded BLAS calls, 
but intend to extend our code to work in a distributed memory setting in the future. 
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Algorithm 5: diagonalMultiply: Given an input matrix G, forms the product 
where is an off-diagonal block in . This is done without forming ex¬ 
plicitly. The alignSet function performs the following operation: alignSet(5'i, S 2 ) = 
{i — min(S' 2 ) -f 1 : i G S'!}. We will use this function to express row and column sets 
for block s relative to other blocks in the hierarchy. 


input : j, G, transpose, s 

output: 

1 begin 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


Apply necessary diagonal block inverse to G 

Si — left child of s 

G ^— diagonalSolve(j, G, transpose = true, ai) 

// Initialize a workspace for multiplication 

w ^ 

II Apply contributions from supernode descendants 

for each k G Dj do 

II Extract the required sub-matrix from G 
^ — gatherRows(G, n 

II Form the needed product with two multiplications 

1 T 


T 

T 




II Scatter result to the output matrix 

W ^— W — scatterRows(T, H 


16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 


// Apply contributions from other blocks in 
p = parent of s 
while p is not null do 

// Block p contributes to s only if it appears 
// earlier in the ordering 

if p < s then 

// Get necessary row and column ranges from p 
II This is valid since C-Rand 
Rsub = alignSet Rf^) 

Csub = alignSet 


// Perform multiplication 
G prod 

T <— 

T <- GprodT! 

T 




D 

j^P 

^G 


VfjRsub,:) 


similar to 


32 

33 

34 

35 

36 


Accumulate result in workspace 

W-T 

Continue moving up the block hierarchy 
p i — parent of p 

return W 
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Algorithm 6: factorDiagonal: Builds a factored diagonal block within . s is 
assumed to be the index of a diagonal block. We introduce two new pieces of notation 
here: , = {1 < p < € CfJ and , = {r^ € 51, : rl G CfJ. 

That is, these are rows from descendant k which are relevant to the formation of 
diagonal block s within . 


input : j, s 
output: 

1 begin 

2 // Initialize Schur complement with matrix contents. 

3 // This is a diagonal block, so = 

5 // Accumulate contributions from descendants. 

6 for each k e Dj do 

7 // Build dense update block 

8 diagUpdate ^ Lf ,, :)Lf ,, :)^ 

9 // Scatter updates to the Schur complement 


Ujg <— Ujg - scatter(diagUpdate, 


D (dD mD pD \ 


// Accumulate contributions from previous blocks 
// in Lf. 
p < — parent of s 
while p is not null do 

// Block p contributes to s only if it appears 
earlier 

// in the ordering 

if p < s then 

// Get necessary row and column ranges from p 
// This is valid since C and C Rf^ 
Rsub = alignSet Rfp) 

Csub = alignSet{CRfp) 

// Build a dense update matrix 


^ prod 


T ^— 

Vprod[V^^p{Csub, :)]^ 

T <— 

Y!^^p{Rsub,:)T 


// Subtract update from Schur complement 
i — - T 

11 Factor node j's diagonal block 
Lys < — CholeskyCUj^g) 


return 
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N 

20 

30 

40 

50 

60 

70 

80 


n 

27783 

89373 

206763 

397953 

680943 

1073733 

1594323 


Total time (s) 

194 

1170 

4065 

10520 





Mean time (s) 

14 

84 

271 

701 




Jacobi 

Total iterations 

49579 

87957 

133361 

170750 





Mean iterations 

3541 

6282 

8890 

11383 





Total time (s) 

207 

1334 

5159 

13540 




ICC 

Mean time (s) 

15 

95 

344 

903 




(IFPACK) 

Total iterations 

48637 

87244 

132981 

168172 





Mean iterations 

3474 

6231 

8865 

12110 





Total time (s) 

270 

1447 

4763 

11360 




\/TT 

Mean time (s) 

19 

103 

318 

757 




iVlL 

Total iterations 

5949 

10170 

15249 

19109 





Mean iterations 

425 

726 

1017 

1274 





Total time (s) 

32 

173 

618 

1470 

2862 

6174 

12350 

p Qr’ 

Mean time (s) 

2.3 

12 

41 

98 

191 

412 

823 

JXkj V-- 

Total iterations 

349 

615 

1209 

1514 

1729 

2278 

4050 


Mean iterations 

24 

41 

80 

100 

115 

151 

270 


Table 6.1 

Nonlinear elasticity (p = 1) performance results: Relative performance of standard iterative methods (Ja¬ 
cobi, incomplete Cholesky, and multigrid) compared to rank-structured Cholesky. 



N 

10 

15 

20 

25 

30 

35 

40 

45 


n 

27783 

89373 

206763 

397953 

680943 

1073733 

1594323 



Total time (s) 

501 

3731 

13610 

33920 





T 

Mean time (s) 

36 

267 

907 

2261 





J acobi 

Total iterations 

60289 

110971 

175525 

231833 






Mean iterations 

4306 

7926 

11701 

15455 






Total time (s) 

61 

287 

955 

2473 

5097 

10150 



ICC 

Mean time (s) 

4.4 

20 

64 

165 

340 

677 



(Aztec) 

Total iterations 

897 

1590 

2689 

3914 

5341 

7030 




Mean iterations 

64 

113 

179 

260 

356 

468 




Total time (s) 

918 

4431 

13920 

34950 

73690 

144900 

227800 


IV/TT 

Mean time (s) 

66 

317 

928 

2330 

4913 

9660 

15187 


iVlL 

Total iterations 

9149 

13770 

20748 

26534 

32292 

40126 

43509 



Mean iterations 

653 

983 

1383 

1768 

2152 

2675 

2900 



Total time (s) 

42 

200 

616 

1460 

2844 

5139 

9873 

14900 


Mean time (s) 

3 

13 

41 

97 

190 

668 

658 

993 

Iv J V- 

Total iterations 

121 

196 

343 

438 

497 

668 

1438 

1429 


Mean iterations 

8 

13 

22 

29 

33 

44 

95 

95 


Table 6.2 

Nonlinear elasticity (p = 2) performance results: Relative performance of standard iterative methods (Ja¬ 
cobi, incomplete Cholesky, and multigrid) compared to rank-structured Cholesky. 
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Symbol 


Meaning 


L. 

L?,L? 


Uf,Uf 


pD pO 


rnD cpO 
'^k^j^'^k^j 


cpD 


r>D 

■^/e— "’'^k^j,s 


v„u, 


pD 


rpD pD 


System matrix (generally assumed to have been permuted with fill- 
reducing ordering). 

Cholesky factor matrix (of permuted system). 

First and last column in supernode j. 

Supernode j’s column set. 

Set of columns occurring after supernode j. 

Set of non-zero rows in supemode j’s off-diagonal. 

Supernode j’s block column in L. 

Diagonal and off-diagonal blocks of Lj, respectively. 6 jrICjIxIGjI 


and Ly* g 


x|ed 


Schur complement matrices corresponding to and L® 
Indices of supernode descendants of supernode 

{1 < fc < J : Ikfc n Cj ^ 0}. 

Recalling that Jlj, = {rl,rl,...}, = {1 < P < |3^fc 


e e,} 


3 i 

in 


and = {1 < p < € Ikj}. That is the set of rows i 

node k needed to construct and L®, respectively (relative to the row 
space of L^). 

Similar to the definitions above, = {r^ 6 Ikfc : g 6^} and 

~ i^k ^ ^ rows in node k 

needed to construct and L®, respectively (relative to the full row 
space of L). 

Similar to and , = {1 < P < \^k\ ■ rl g Gf J 

and s = € fkfc : g Cj^s}. That is, these are rows from 

descendant k which are relevant to the formation of diagonal block s 


within hj. 

Low-rank representation for supernode j’s off-diagonal block (Lf sa 

V,UJ). 

Block s from the diagonal matrix Lj^. 

Low-rank representation for block s of . 

Row and column sets over which Lf. is defined, relative to the dense 

J)* 

matrix Lj^. 

Row and column sets over which Lf*. is defined, relative to the full 

3i^ 


system A (or L). 


Table 7.1 

List of symbols 




